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Abstract 

Estimating an unknown distribution from its samples is a fundamental problem in statistics. 
The common, min-max, formulation of this goal considers the performance of the best estimator 
over all distributions in a class. It shows that with n samples, distributions over k symbols can be 
learned to a KL divergence that decreases to zero with the sample size n, but grows unboundedly 
with the alphabet size k. 

Min-max performance can be viewed as regret relative to an oracle that knows the underlying 
distribution. We consider two natural and modest limits on the oracle’s power. One where it 
knows the underlying distribution only up to symbol permutations, and the other where it knows 
the exact distribution but is restricted to use natural estimators that assign the same probability 
to symbols that appeared equally many times in the sample. 

We show that in both cases the competitive regret reduces to min(fc/n, (5(1/Un)), a quantity 
upper bounded uniformly for every alphabet size. This shows that distributions can be estimated 
nearly as well as when they are essentially known in advance, and nearly as well as when they are 
completely known in advance but need to be estimated via a natural estimator. We also provide 
an estimator that runs in linear time and incurs competitive regret of D(min(A:/n, l/^/n)), and 
show that for natural estimators this competitive regret is inevitable. We also demonstrate the 
effectiveness of competitive estimators using simulations. 


1 Introduction 


1.1 Background 

The basic problem of learning an unknown distribution from its samples is typically formulated in 
terms of min-max performance with KL-divergence loss. Though simple and intuitive to state, its 
precise formalization requires a modicum of nomenclature. 

Let "P be a known collection of distributions over a discrete set A, and let Ai, A 2 ,... be samples 
generated independently according to p. A distribution estimator q over X associates with any 
observed sample sequence x* G A* a distribution q^* over A. The performance of q is evaluated in 
terms of a given distance measure, and we will use the popular KL divergence 


D{p\\q) 


p{x) 

q{x)' 
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KL divergence reflects the increase in the number of bits over the entropy needed to compress the 
output of p using an encoding based on q and the log-loss of estimating p by q, see Cover and 
Thomas [2006]. 

def 

Given n samples X” = Xi,X 2 i ■ ■ ■ ,Xn generated independently according to p, the expected 
loss of the estimator q is 

rn{q,p) = IE [D{p\\q^„)], 
the worst-case loss of q for any distribution in V is 

=^maxr„(g,p), (1) 

p£V 

and the lowest worst-case loss for V, achieved by the best estimator is 

rniV) min rn{q,V). (2) 

<? 


Namely, rn{V) is the min-max loss 


f'ni'P) = minmaxr„(g,p). 
q peP 

Min-max performance can be viewed as regret relative to an oracle that knows the underlying 
distribution. Thus we refer to the above quantity as regret from here on. 

1.2 Prior work 

The most natural and important collection of distributions is the set of all distributions over 
the alphabet X. To simplify notation, assume without loss of generality that the alphabet is 
[k] = {1, 2,... k}, and then the set of all distributions is the simplex in k dimensions, 

/(p(l); • • ■ iP{k)) '■ VI < i < /c, p{i) > 0 and = 1 

I i=i 

Krichevsky [1998] introduced the problem of estimating r„(Afc), and Braess and Sauer [2004] showed 
that as /c/n —)• 0, 

^° (9 ■ 

This result specifies the rate at which distributions in can be approximated in KL divergence 
as the number of samples increases. It also implies the upper bound of the well known logn 
redundancy of i.i.d. distributions, derived by Krichevsky and Trofimov [1981]. Other loss measures, 
including (.i, were considered in Kamath et al. [2015], and related results appeared in Han et al. 
[2014]. 

Motivated by natural-language processing, bioinformatics, and other modern applications, there 
has been a fair amount of recent interest in evaluating and achieving the optimal regret in the non- 
asymptotic regime, and when the sample size is not overwhelmingly larger than the alphabet size. 
For example in English text processing, the alphabet is English vocabulary whose size is comparable 
to the number of times a context has appeared in the corpus. 
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It can be shown that when the sample size n is linear in the alphabet size k, rn{^k) is a constant, 
and Paninski [2004] showed that as fe/n —)• oo, 

k f k 

rniAk) = log - + o log - 
n \ n 

If follows that distributions cannot be well learned with a number of samples that is comparable 
to their alphabet size. 

Several modifications have been proposed to address this problem. Orlitsky et al. [2003] modified 
the loss to reflect estimating the probabilities of each previously-observed symbol, and all unseen 
symbols combined. They showed that the corresponding regret can be upper bounded in terms of 
the number of samples n, regardless of the alphabet size k. McAllester and Schapire [2000], Drukh 
and Mansour [2004], Acharya et al. [2013a] estimated the combined probability of symbols that 
appeared a given number of times. Several others have restricted the collections of distributions in 
Afc to monotone, unimodal, or log-concave distributions, see e.g., Birge [1987], Chan et al. [2013]. 

In this paper we address the original problem, with KL divergence regret, and the loss for 
the whole collection A^. However, instead of considering min-max regret we take a competitive 
approach where we compare and show that it is possible to learn the distribution with a uniformly- 
bounded regret. 



2 Competitive formulation 

2.1 Background 

While (3) is asymptotically tight, it addresses the worst-case regret over all possible distributions 
in Afc. For smaller distribution collections in A^ lower regret may be achieved. 

Our goal is to derive a data-driven estimator that approaches the performance of the best 
estimator for any reasonable sub-collection of A^. 

Example 1. Consider the constant-i distribution over [k], 

1 for j = i, 

0 for j 7 ^ i, 

and the collection of all constant distributions, 

= A,. 

Clearly, for all n>l, 

as the data-driven estimator q that assigns probability 1 to the seen symbol Xi and probability 0 to 
all other k — 1 symbols, estimates p exactly. 

Our goal is to derive a single estimator that simultaneously achieves essentially the lowest regret 
possible for every reasonable sub-collection of A^. 

Note that the definition of loss (regret) can be viewed as competing with a person who knows 
the underlying distribution p and can use any estimator q, naturally choosing p itself. 


= 0 
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Two simple relaxations of the problem clearly come to mind. First, competing with a person 
who has only partial information about p. And second, competing with a person who knows p but 
can only use a restricted type of estimator, in particular, only estimators that would arise naturally. 

We define these two modified regrets and show that under both the modified regrets, one can 
uniformly bound the regret. 

2.2 Competing with partial information 

One way to weaken an oracle-based estimator is to provide it with less information. Consider an 
oracle who, instead of knowing p € V exactly, has only partial knowledge of p. For simplicity, we 
interpret partial knowledge as knowing the value of f{p) for a given function / over V. Any such / 
partitions V into subsets, each corresponding to one possible value of /, and henceforth, we will use 
this equivalent formulation, namely P is a known partition of V, and the oracle knows the unique 
partition part P such that p € P £ 

For every partition part P G P, an estimator q incurs the worst-case regret in (1), 

rn{q,P) = maxr„(g,p). 

P&P 

The oracle, knowing P, incurs the least worst-case regret (2), 

rn(P) = minr„(g, P). 

Q 

The competitive regret of q over the oracle, for all distributions in P is 

r„(g,P) -r„(P), 

the competitive regret over all partition parts and all distributions in each is 

rliq, V) m|x (r„(g, P) - rn(P)), 

and the best possible competitive regret is 

rl{V) =VinrJ^(g,P). 

<? 

Consolidating the intermediate definitions, 

r^{V) = min max ( maxr„(( 7 ,p) — rn{P) 
q PeP \p£P 

Namely, an oracle-aided estimator who knows the partition part incurs a worst-case regret rn{P) 
over each part P, and the competitive regret r’J^(P) of data-driven estimators is the least overall 
increase in the part-wise regret due to not knowing P. The following examples evaluate rJ^(P) for 
the two simplest partitions of any collection V. 

Example 2. The singleton partition consists of \V\ parts, each a single distribution in V, 
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An oracle-aided estimator that knows the part containing p knows p. The competitive regret of 
data-driven estimators is therefore the min-max regret, 

Pi 

rn^iV) = min max {rn{q,{p}) - rn{{p})) 
q p£P 

= min max rn{q,v) 
q p&V 

= rn{V), 


where the middle equality follows as rn{q,{p}) = rn{q,p), and rn{{p}) = 0. 

Example 3. The whole-collection partition has only one part, the whole collection V, 


Pi 


def 


{V]. 


An estimator aided by an oracle that knows the part containing p has no additional information, 
hence no advantage over a data-driven estimator, and the competitive regret is 0, 


= min max 
q P£{P} 


maxrn{q,p) 

P&P 


rn{P) 


= min ( maxrn{q,p) — rniV) 
q \p&V 

= min max (rn(q,p)) — fniV) 
q p&V 

= rn(V) - rn{V) 

= 0 . 


The examples show that for the coarsest partition of V, into a single part, the competitive 
regret is the lowest possible, 0, while for the finest partition, into singletons, the competitive regret 
is the highest possible, 

A partition P' refines a partition P if every part in P is partitioned by some parts in P. For 
example {{a, 6}, {c}, {d, e}} refines {{a,b,c},{d,e}}. It is easy to see that if P' refines P then for 
every q 

r^^{q,V)>rl{q,V) (4) 

The definition implies that ifV' TV then rn{q,V') < rn{V), hence for every q, 


rniQ,P) = m|x {rn{q,P') - rn{P')) 


= max ( max 
PgP ^PDP'eF' 


rniq,P') 


-rn{P)) 


= max {rn{q, P) - rn{P)) 

PeP 

= rl{q,V). 


Note that this notion of competitiveness has appeared in several contexts. In data compression 
it is called twice-redundancy Ryabko [1984, 1990], Bontemps et al. [2014], Boucheron et al. [2014], 
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while in statistics it often called adaptive or local min-max see e.g., Donoho and Johnstone [1994], 
Abramovich et al. [2006], Bickel et al. [1993], Barron et al. [1999], Tsybakov [2004], and recently 
in property testing it is referred as competitive Acharya et al. [2011, 2012, 2013b] or instance-by¬ 
instance Valiant and Valiant [2014]. 

Permutation class 

Considering the collection of all distributions over [k], if follows that as we start with single-part 
partition {A^} and keep refining it till the oracle knows p, the competitive regret of estimators will 
increase from 0 to rn{^k)- 

A natural question is therefore how much information can the oracle have and still keep the com¬ 
petitive regret low. We show that the oracle can know the distribution exactly up to permutation, 
and still the relative regret will be very small. 

Call two distributions p and p' over [fc] permutation equivalent if there is a permutation a of [k] 
such that 

P'aii) = Ph 

for example, over [3], (0.5, 0.3, 0.2) and (0.3,0.5,0.2) are permutation equivalent. Permutation 
equivalence is clearly an equivalence relation, and hence partitions the collection of distributions of 
[k] into equivalence classes. Let be the corresponding partition. We construct estimators q that 
uniformly bound rj^‘^(g, A^), thus the same estimator uniformly bounds rj^(g, A^) for any coarser 
partition of A^ such as partitions such that each class contains distributions with same entropy, or 
same sparse-support. 

2.3 Competing with natural estimators 

Another restriction on the oracle-aided estimator is to still let it know p exactly, but force it to be 
“natural”, namely, to assign the same probability to all symbols that appeared the same number 
of times in the sample. For example, for the observed sample a, b, c, o, 6, d, e, to assign the same 
probability to a and b, and the same probability to c, d, and e. 

Since data-driven estimators derive all their knowledge of the distribution from the data, we 
expect them to be natural. We also saw in the previous section that natural estimators are optimal 
under for the permutation-invariant oracle. 

We now compare the regret of data-driven estimators to that of natural oracle-aided estimators. 
For a distribution p, the lowest regret of natural estimators is 

C‘(P) min rn(g,p), 

qggnat 

where is the set of all natural estimators. The regret of an estimator q relative to the best 
natural-estimator designed with knowledge of p is 

rT{q,P)=rn{q,p)-rT{p)- 

The regret of data-driven estimators over V is therefore, 

= minmaxr;;‘‘*(g,p). 

<? peP 
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We show that indeed for the class of discrete distributions over support [fc], i.e., A^, is 

uniformly bounded. 

The rest of the paper is organized as follows. In Section 3, we state our results and in Section 4, 
provide the proofs. In Section 5, we compare the competitive estimator to other min-max motivated 
estimators using experiments. 


3 Results 


We show that the estimator proposed in Acharya et al. [2013a] (call it q') is competitive if partial 
information is known or if we restrict the class of estimators to be natural. In Theorem 9, we prove 

A,) < rr (g', A,) < O (min (^> ^) ) • (5) 


Thus for any coarser partition P, the same result holds. Here O and later H hide multiplicative 
logarithmic factors. Together with Lemma 8, the lower bounds in Acharya et al. [2013a] can be 
extended to show that 

C‘(Afc) >Hfminf^,- 
\ \\/n n 

Thus the performance of the estimator proposed in Acharya et al. [2013a] is nearly-optimal com¬ 
pared to the class of natural estimators. Equation (5) immediately implies rJ^'^(Afc) < < 

O ^min . However Equation (3) and the fact that the min-max estimator proposed 

in Braess and Sauer [2004] is natural imply 



= min max (E[D(pllg)] - ^‘(p)) 

<? peAfe 

< min max E[Zl(pj jg)] 

9 peAfe 

^ (fc-l)(l + o(l)) ^ 

2n 

The above equation together with Equation (5) implies a stronger bound: 


r^(Afc) < rr(Afc) < min 




(fc-l)(l + o(l)) \ 
2n j 


4 Proofs 

The proof consists of two parts. We first show that for every estimator q, A^) < r^^^{q,Ak) 

and then upper bound A^) using results on combined probability mass. 

4.1 Relation between r^‘^{q,Ak) and A^) 

We now show an auxiliary result that helps us relate A^) to r^'^{q, A^). Eor a symbol x, let 

n{x) denote the number of times it appears in the sequence. 
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Lemma 4. For every class P 


rn{P) > maxr;“‘(p). 
peP 

Proof. We first show that the estimator that there is an optimal estimator q that is natural. In 
particular, let 

where is the set of all permutations of k symbols. We show that q"^n{y) is an optimal estimator 
for P. Since a{x")i^{y)) for ^riy permutation a, the estimator achieves the same loss 

for every p € P 

maxrn(gV>P) = ff 


o-eSfe 


For any estimator q, 


( 6 ) 


(a) 1 


maxE[D(p||g)] > Y 1^)] 


kl 

k\ 

1 


(TGSi; 


5^P(fr(x’"?/))log-^ 

3-eSj. x"eA’" 


crESfc 


-k^YY Yp^^^^ y))fog^ 

■ x"eA^" treSfe ySA^ ^ 


fc! 

(£) J_ 

k\ 


Y Y X]^(^(®''2/))fog 


^o-"eSj. 
1 


o-eSj. x"eA^" psA’ 


9"x"(2/) 


p{a"{x^y)) 

-Hip) 


-Hip) 


Y ^niq x--^Pi<^i-)))- 


(jESfc 


(a) follows from the fact that maximum is larger than the average. (6) follows from the fact that 
every distribution in P has the same entropy. Non-negativity of KL divergence implies (c). All 
distributions in P has the same entropy and hence (d). Hence together with Equation (6) 


r„(P) = minmaxE[Z)(p| 1(7)] 
q p£P 

- ll.Y '^riiq"x^,pi(^i-))) 

creEj. 

= maxr„((7" „,p). 
peP 


Hence q” xn is an optimal estimator. q"^n is natural as if niy) = niy'), then q”^n{y) = q”x^iv')- 







Since there is a natural estimator that achieves minimum in rn{P), 

rn{P) = minmaxE[Ii)(p||g)] 
q p£P 

= min maxE[D(p|I q)! 

ggQnat pgp 

> max min K\D(p\\q)] 
p&P 

= maxC‘(p), 

P&P 

where the last inequality follows from the fact that min-max is bigger than max-min. □ 

We now relate A^) to A^). 

Lemma 5. For every estimator q, 


r^(g,Afc) <r;“‘(g,Afc). 


Proof. 


= max ( maxE[L>(p||g)] - r„(P) 

PeiPfT \p£P 


(a) 

< max 


-PePa \P&P 


( maxE[D{p\\q)] — maxr“‘(p) 


pGP 


(b) 

< maxmax(E[L)(p||g)] — 

“ PeP<T peP n 

= max (E[L>(p||g)] - r“*(p)) 

= rT{q,^k). 

(a) follows from Lemma 4. Difference of maximums is smaller than maximum of differences, hence 

(b) . □ 


4.2 Relation between A^) and combined probability estimation 

We now relate the regret in estimating distribution to that of estimating the combined or total 
probability mass, defined as follows. For a sequence x”, let ^(t) denote the number of symbols 
appearing t times and S{t) denote the total probability of symbols appearing t times. Similar to 
KL divergence between distributions, we define KL divergence between S and their estimates S as 

D{S\\S) = y S{t) logS^. 

S Sit) 

We hnd the best natural estimator that minimizes r!ff^{p) in the next lemma. 

Lemma 6. Let q*ix) = then 

q* = arg min rniq,p) 

q^Qnat 
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and 


rn\v) = IE |^5(t)log 
Proof. For a sequence x"^ and estimator q, 


Li=0 


m 

S{t) 


- Hip). 


^ M^)iogz:7:;q 


x£X 


t=0 


m 

^ ’ t=0 x-.n{x)=t ^ > t=0 ^ ^ 

n ^ n /h(+\ 

j:s(«)iog^-5:s(t)iog|^ 

9x"(x) ^ Sit) 


i=0 

n 




5(t) 


i=0 

> 0 , 


^it)qx^ix) 


where the last inequality follows from the fact that J2t ~ Xlt = 1 and KL divergence 

is non-negative. Furthermore the last inequality is achieved only by the estimator that assigns 

*/ \ _ Sjnjx)) jj 
y - 4>{n{x)) ■ -nence, 


rTip) 


min E 

qggnat 


^ Pix) log 

_xGA' 


pix) 
g'x" ix) 


-ffip)+E 


log 

.t=o 


m 

Sit) 


□ 


Since the natural estimator assigns same probability to symbols that appear the same number 
of times, estimating probabilities is same as estimating the total probability of symbols appearing 
a given number of times. We formalize it in the next lemma. 

Lemma 7. For a natural estimator q let Sit) = J2x-n{x)=tl(^)’ then 

rniq,p)=E[DiS\\S)]. 


Proof. For any estimator q and sequence x"^, 


1 1 

E p(n log —^ = E E m 




t=o 


t=0 


Sit) 


EswiogfU + tswiog^w 


t=0 


Sit)- 
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Thus by Lemma 6, 


rn{q,p) = -H{p)+E 


= E 


^ 5(t) log ^ ^ 5(t) log 


= E[D{S\\S)]. 


t=o t=o 

S{t) 


m 

S{t) 


+ H{p) - E 


'^S{t)log 

.t=o 


m 

S{t) 


□ 


Taking maximum over all distributions p and minimum over all estimators q results in 
Lemma 8. For a natural estimator q let S{t) = Ylx n{x)=t^i^)’ Ihen 


rr(g,Afc) = maxE[I)(5||5)]. 


Furthermore, 

= minmaxE[L>(5||5)]. 

S 

Thus finding the best competitive natural estimator is same as finding the best estimator for 
the combined probability mass S. Acharya et al. [2013a] proposed an algorithm for estimating S 
such that for all k with probability > 1 — 1/n, 

max[Z?(5||5) = 6 
P&Ak Vv^ 

The result is stated in Theorem 2 of Acharya et al. [2013a]. One can convert this result to a result 
on expectation easily using the property that their estimator is bounded below by l/2n and show 
that 

maxE[T»(5||5)] = O f ^ 

p&^k 

A slight modification of their proof for Lemma 17 and Theorem 2 in their paper using \/^(i) < 
Sr=i ^ shows that their estimator S for the combined probability mass S satisfies 

maxE[Zl(S'||5)] = O (min (^=, — 
peAfc V \Vn n 

Above equation together with Lemmas 5 and 8 shows that 

Theorem 9. For any k and n, the proposed estimator q in 

r^(g,Afc)<rr((Z,Afc)<a(^min 


Acharya et al. [2013a] satisfies 


1 k 
-y/n’ n 
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5 Experiments 


For small values of n and k the estimator proposed in Acharya et al. [2013a] behaves as a com¬ 
bination of Good-Turing and empirical estimator. Hence for experiments, we use the following 
combination of Good-turing and empirical estimators 


q{x) 


n(x) 

~w~ 

max(^(n(ai)+l),l) n{x)-\-l 

4>{n{x)) N 


if n(x) > (p{n{x) + 1), 
else, 


where N is the normalization factor to ensure that the probabilities add to 1. We compare the 
above competitive estimator with several popular add-/? estimators S of the form 


t Pg 


where N{S) is a normalization factor to ensure that the probabilities add up to 1. 

The Laplace estimator has = IVt. It is optimal when the underlying distribution is 

generated from the uniform prior on A^. The Krichevsky-Trofimov estimator has PxTit) = 1/2 Vt 
and is min-max optimal for the cumulative regret or when the underlying distribution is generated 
from a Dirichlet-1/2 prior. The Braess-Sauer estimator has /3bs(0) = 1/2, /^^^(l) = = 

3/4 Vt > 1 and is min-max optimal for r,i(Afc). We also compare against the best regret by any 
natural estimator, which simply estimates p{x) by q{x) = Igll (see Lemma 6). 

We compare the above five estimators for six distributions with support k = 10000 and number 
of samples n < 50000. All results are averaged over 200 trials. 

The six distributions are uniform distribution, step distribution with half the symbols having 
probability 1/2A: and the other half have probability 3/2A:, Zipf distribution with parameter 1 
{p{i) oc Zipf distribution with parameter 1.5 (p(i) oc a distribution generated by the 

uniform prior on A^,, and a distribution generated from Dirichlet-1/2 prior. 

The results are given in Figure 1. The proposed estimator uniformly performs well for all the 
six distributions and is close to what the best natural estimator can achieve. Furthermore for Zipf, 
uniform, and step distributions the performance is significantly better. 

The performance of other estimators depend on the underlying distribution. For example, since 
Laplace is the optimal estimator when the underlying distribution is generated from the uniform 
prior, it performs well in Figure 1(e), however performs poorly on other distributions. 

Furthermore, even though for distributions generated by Dirichlet priors, all the estimators have 
similar looking regrets (Figures 1(e), 1(f)), the proposed estimator performs better than estimators 
which are not designed specifically for that prior. 
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(a) Uniform (b) Step 
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Figure 1: Simulation results for support 10000, number of samples ranging from 1000 to 50000, 
averaged over 200 trials. 
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